NSF PAR Search | NSF Public Access Repository

Static Analysis for Checking the Disambiguation Robustness of Regular Expressions

https://doi.org/10.1145/3656461

Mamouras, Konstantinos; Le_Glaunec, Alexis; Li, Wu_Angela; Chattopadhyay, Agnishom (June 2024, Proceedings of the ACM on Programming Languages)

Regular expressions are commonly used for finding and extracting matches from sequence data. Due to the inherent ambiguity of regular expressions, a disambiguation policy must be considered for the match extraction problem, in order to uniquely determine the desired match out of the possibly many matches. The most common disambiguation policies are the POSIX policy and the greedy (PCRE) policy. The POSIX policy chooses the longest match out of the leftmost ones. The greedy policy chooses a leftmost match and further disambiguates using a greedy interpretation of Kleene iteration to match as many times as possible. The choice of disambiguation policy can affect the output of match extraction, which can be an issue for reusing regular expressions across regex engines. In this paper, we introduce and study the notion of disambiguation robustness for regular expressions. A regular expression is robust if its extraction semantics is indifferent to whether the POSIX or greedy disambiguation policy is chosen. This gives rise to a decision problem for regular expressions, which we prove to be PSPACE-complete. We propose a static analysis algorithm for checking the (non-)robustness of regular expressions and two performance optimizations. We have implemented the proposed algorithms and we have shown experimentally that they are practical for analyzing large datasets of regular expressions derived from various application domains.

Search for: All records